Introduction

Overview and Motivation

During the last 2 years, COVID-19 has been a main focus of the news. Though around 3% of the world population had COVID-19, diabetes can be considered as an even bigger health problem. Indeed, according to the International Diabetes Foundations (IDF), in 2019, 463 million adults were living with diabetes (around 6-7% of the world population) and this number is forecasted to rise to 700 million by 2050. Furthermore, 90% of cases of diabetes are of type 2, which means it results mainly from bad habits and not genetics. However both types of diabetes can be treated and/or prevented with a healthier diet and more physical activity. Additionally, according to the WHO, low income countries are more susceptible to having higher diabetes prevalence. Living in Europe, we observed that diabetes rates differ a lot from one country to another, so we wanted to find out if these rates were indeed linked to a country’s income, and if the nutritious composition of richer states’ population’s diet is also affected by this income difference and if yes, how it is affected.

Research questions

Therefore, we would like to find out answers to the following questions :

  1. Do European countries that have higher GDPs really have lower diabetes prevalence ?

  2. Do European countries that have higher GDPs consume less calories ?

  3. How do the proportions of macronutrients (animal protein/plant protein/fat/carbohydrates) consumed differ between richer and poorer governments ?

  4. And how do these differences relate to the diabetes prevalence in these countries ?

  5. What is the typical diet that can be observed in richer states that relates to lower diabetes prevalence ?

Data

To answer our research questions, we used three different datasets. While searching for datasets, we made sure that the years and countries matched for every one of them.

Wrangling and cleaning

Caloric consumption

The first dataset we used, downloaded from the portal https://ourworldindata.org/diet-compositions, contains information related to the supply of macronutrients in calories for different countries. We used data related to food supply rather than food consumption, as the latter is harder to find and generally, supply reflects the population’s demand and therefore its food consumption. The dataset gives us information on the average nutrition of different countries from 1961 to 2013 :

  • It is composed of 8981 observations of 7 variables:

    • Entity Name of the country
    • Code ISO country code
    • YearYear of the observation
    • Calories from animal protein (FAO (2017)) The average per capita supply of calories derived from animal protein all measured in kilocalories per person per day
    • Calories from plant protein (FAO (2017)) The average per capita supply of calories derived from plant protein, all measured in kilocalories per person per day
    • Calories from fat (FAO (2017))The average per capita supply of calories derived from fat, all measured in kilocalories per person per day
    • Calories from carbohydrates (FAO (2017)) The average per capita supply of calories derived from carbohydrates, all measured in kilocalories per person per day

The intake of specific macronutrients (carbohydrates, protein and fats) are derived based on average food composition factors – these factors are derived and presented in the Food and Agriculture Organisation’s (FAO) Food Balance Sheet Handbook (https://www.fao.org/faostat/en/#data).

We will only focus on observations of European countries in the 2000s.

We used the ISO code as it is standardized worldwide and does not have the risk of having different names in different tables like the countries’ names.

Then, we proceeded to compute the mean of the consumption for each type of macronutrient in each country between the years 2000 and 2013, and the sum of total calories per person per day for each country in order to answer our second research question.

We then created a new table by adding the sum of total calories per person per day for each country in order to get a broader view with the total consumption of calories. To make sure that the joining of tables go smoothly, we also removed duplicates and the country name column.

Our assumption was that a county’s wealth may fluctuate over the course of 10 years (ex: a dip during the economic crisis of 2008) but an overall mean is sufficient to compare the different countries and their riches.

  • We now have a dataframe with the following variables :

    • country_code ISO country code
    • cal_prot_animalThe mean of the calories from animal protein consumed per person in each country in the years 2000-2013
    • cal_prot_plant The mean of the calories from plant protein consumed per person in each country in the years 2000-2013
    • cal_carbsThe mean of the calories from carbohydrates consumed per person in each country in the years 2000-2013
    • cal_fat The mean of the calories from fat consumed per person in each country in the years 2000-2013
    • total_consumption The total calorie consumption per person based on the means of the consumption of each type of macronutrients in each countries in the years 2000-2013
The final table is the following:
Table 1: Caloric Consumption
Country Code Calories from animal protein Calories from plant protein Calories from carbohydrates Calories from fat Total consumption
AUT 245 169 1833 1454 3702
BEL 238 158 1856 1467 3719
BGR 155 168 1606 846 2775
HRV 168 148 1691 940 2946
CYP 197 127 1291 1019 2633
CZE 218 153 1728 1155 3254
DNK 273 157 1746 1190 3366
EST 212 167 1967 842 3188
FIN 269 166 1623 1177 3234
FRA 293 161 1611 1480 3545
DEU 240 159 1785 1276 3460
GRC 250 204 1744 1338 3536
HUN 193 153 1560 1221 3126
IRL 279 174 1950 1187 3590
ITA 241 204 1776 1390 3612
LVA 203 154 1687 1044 3087
LTU 274 198 2055 858 3385
LUX 288 148 1752 1318 3507
MLT 237 204 1932 994 3367
NLD 292 136 1599 1195 3222
POL 204 197 1969 1035 3405
PRT 275 177 1824 1240 3516
ROU 200 220 2003 916 3340
SVK 140 150 1610 952 2853
SVN 230 168 1664 1067 3129
ESP 279 159 1481 1323 3243
SWE 285 143 1566 1137 3131
CHE 237 138 1660 1392 3426
GBR 232 178 1748 1256 3414

GDP

Our second dataset, downloaded from the portal https://data.worldbank.org, gives us information about the GDP of many countries over the course of 60 years (1960-2020).

  • It is composed of 266 observations of 65 variables :

    • Country Name Name of the country
    • Country Code ISO country code
    • Indicator Name equal to “GDP in current US$” for every row
    • Indicator Code equal to “NY.GDP.MKTP.CD” for every row
    • And a variable for each year from 1960 to 2020

As we can see below, RStudio imported the Excel file as is, and so our column names found themselves at the 3rd row and therefore column names of columns 3 to 65 have been attributed numbers.

We decided to fix that and to filter out the years that is in our interest and that we have in common with other tables, so the years 2000-2013. We decided to get rid of the Indicator Name and Indicator Code variables as well since the values are the same for every row and they do not provide additional useful information.

Now, we want to filter out the European countries, just like in the first table :

In order to join tables easily, we transformed the columns corresponding to different years to a single “year” column, in order to have at each row of this dataset the GDP of a certain country at a certain year.

To make it easier to manipulate data, we decided to rename our variables for this table as well. We also made sure that the type of our numeric variable (GDP) was numeric and not character, like it was by default. In order to have graphs that are easy to read in the exploratory data analysis, we also decided to divide the avg_gdp column by a billion.

Lastly, we computed the average GDP for each country over the years 2000-2013 in order to be able to plot different variables together.

Table 2: GDP
Country Name Country Code Average GDP (in billion $)
Austria AUT 335.98
Belgium BEL 406.97
Bulgaria BGR 37.41
Croatia HRV 48.12
Cyprus CYP 20.15
Czech Republic CZE 158.48
Denmark DNK 275.37
Estonia EST 16.47
Finland FIN 216.58
France FRA 2283.48
Germany DEU 3003.51
Greece GRC 246.19
Hungary HUN 110.92
Ireland IRL 203.04
Italy ITA 1872.80
Latvia LVA 21.04
Lithuania LTU 30.77
Luxembourg LUX 42.85
Malta MLT 7.27
Netherlands NLD 721.70
Poland POL 365.43
Portugal PRT 200.37
Romania ROU 125.08
Slovak Republic SVK 70.82
Slovenia SVN 39.50
Spain ESP 1176.91
Sweden SWE 425.83
Switzerland CHE 490.18
United Kingdom GBR 2416.76

We now have a dataframe with the following variables :

  • country_name name of the country
  • country_code ISO code of the country
  • avg_gdp the average GDP of a country over the course of 2000-2013

Population

Since we will be observing the relation between the GDP with the calories consumed per person, it could be useful to have the GDP per person for the analysis. This is why we will be importing this dataset from https://data.worldbank.org/indicator/SP.POP.TOTL which gives us information on the evolution of the population per country over 1960-2020.

  • It is composed of 266 observations of 65 variables :

    • Country Name Name of the country
    • Country Code ISO country code
    • Indicator Name equal to “Population, total” for every row
    • Indicator Code equal to “SP.POP.TOTL” for every row
    • And a variable for each year from 1960 to 2020

As this dataset comes from the same source and is the same file type as GDP, we can proceed with the same wrangling

In order to analyze the link between the GDP per person and calorie consumption per person, we will create a separate table which we will join to the final clean dataset.

We now have a dataframe with the following variables :

  • country_name name of the country
  • country_code ISO code of the country
  • gdp_per_person the average GDP per person of a country over the course of 2000-2013

Diabetes

The third dataset, downloaded from https://www.ncdrisc.org/data-downloads-diabetes.html, gives us information about the age-standardised diabetes prevalence for each country and gender from 1980 to 2014.

  • It is composed of 14’000 observations for 7 variables :

    • Country/Region/World Name of the country
    • ISO ISO country code
    • Sex Gender for which the diabetes prevalence is measured in a certain country at a certain year
    • Year Year of observation (1980-2014)
    • Age-standardised diabetes prevalence Diabetes rate considering all ages
    • Lower 95% uncertainty interval Lower confidence interval limit for the diabetes rate
    • Upper 95% uncertainty interval Higher confidence interval limit for the diabetes rate

Like with the first 2 datasets, we filtered our data to keep only European countries and observations between the years 2000 and 2013.

We also decided not to use the 95% confidence interval variable.

Then, we separated our dataset into two subsets. One with data about men.

Another one with data about women.

We then changed the variable names of these 2 subsets to facilitate joining tables later on.

Finally we grouped observations by country to get the mean prevalence/rate of diabetes between 2000 and 2013 for each European country :

  • For men :
Table 3: Diabetes men
Country Code Diabetes rate
AUT 0.053
BEL 0.057
BGR 0.073
CHE 0.050
CYP 0.077
CZE 0.078
DEU 0.059
DNK 0.055
ESP 0.084
EST 0.071
FIN 0.066
FRA 0.071
GBR 0.063
GRC 0.069
HRV 0.071
HUN 0.080
IRL 0.069
ITA 0.065
LTU 0.078
LUX 0.068
LVA 0.071
MLT 0.088
NLD 0.052
POL 0.074
PRT 0.075
ROU 0.062
SVK 0.072
SVN 0.066
SWE 0.058
  • For women :
Table 4: Diabetes women
Country Code Diabetes rate
AUT 0.053
BEL 0.057
BGR 0.073
CHE 0.050
CYP 0.077
CZE 0.078
DEU 0.059
DNK 0.055
ESP 0.084
EST 0.071
FIN 0.066
FRA 0.071
GBR 0.063
GRC 0.069
HRV 0.071
HUN 0.080
IRL 0.069
ITA 0.065
LTU 0.078
LUX 0.068
LVA 0.071
MLT 0.088
NLD 0.052
POL 0.074
PRT 0.075
ROU 0.062
SVK 0.072
SVN 0.066
SWE 0.058

We now have 2 dataframes with the following variables :

  • country_code ISO code of the country
  • prop_men_diabetes or prop_women_diabetesthe average diabetes rate in a country in the 2000-2013 timeframe

Joining tables

For the last step of our tidying, we joined all four tables in one dataset with the country_code key :

Finale Dataset: GDP, diabetes and calories
Country Name Country Code Average GDP (in billion $) GDP per person (in $) Men Diabetes Women Diabetes Calories from animal protein Calories from plant protein Calories from carbohydrates Calories from fat Total consumption
Austria AUT 335.98 40545 0.053 0.034 245 169 1833 1454 3702
Belgium BEL 406.97 38015 0.057 0.039 238 158 1856 1467 3719
Bulgaria BGR 37.41 4988 0.073 0.064 155 168 1606 846 2775
Croatia HRV 48.12 11189 0.071 0.059 168 148 1691 940 2946
Cyprus CYP 20.15 18890 0.077 0.056 197 127 1291 1019 2633
Czech Republic CZE 158.48 15283 0.078 0.065 218 153 1728 1155 3254
Denmark DNK 275.37 50213 0.055 0.035 273 157 1746 1190 3366
Estonia EST 16.47 12283 0.071 0.064 212 167 1967 842 3188
Finland FIN 216.58 40806 0.066 0.044 269 166 1623 1177 3234
France FRA 2283.48 35700 0.071 0.044 293 161 1611 1480 3545
Germany DEU 3003.51 36733 0.059 0.040 240 159 1785 1276 3460
Greece GRC 246.19 22344 0.069 0.060 250 204 1744 1338 3536
Hungary HUN 110.92 11050 0.080 0.063 193 153 1560 1221 3126
Ireland IRL 203.04 46899 0.069 0.049 279 174 1950 1187 3590
Italy ITA 1872.80 32000 0.065 0.047 241 204 1776 1390 3612
Latvia LVA 21.04 9774 0.071 0.065 203 154 1687 1044 3087
Lithuania LTU 30.77 9697 0.078 0.069 274 198 2055 858 3385
Luxembourg LUX 42.85 87546 0.068 0.039 288 148 1752 1318 3507
Malta MLT 7.27 17759 0.088 0.066 237 204 1932 994 3367
Netherlands NLD 721.70 43883 0.052 0.037 292 136 1599 1195 3222
Poland POL 365.43 9586 0.074 0.066 204 197 1969 1035 3405
Portugal PRT 200.37 19077 0.075 0.052 275 177 1824 1240 3516
Romania ROU 125.08 6062 0.062 0.059 200 220 2003 916 3340
Slovak Republic SVK 70.82 13146 0.072 0.059 140 150 1610 952 2853
Slovenia SVN 39.50 19502 0.066 0.065 230 168 1664 1067 3129
Spain ESP 1176.91 26263 0.084 0.059 279 159 1481 1323 3243
Sweden SWE 425.83 46181 0.058 0.040 285 143 1566 1137 3131
Switzerland CHE 490.18 64036 0.050 0.030 237 138 1660 1392 3426
United Kingdom GBR 2416.76 39338 0.063 0.049 232 178 1748 1256 3414

Missing values

We did not have any NA values in our tables, we think this is due to the fact that we really spent time on gathering quality data that matched in terms of dates and countries.

Exploratory data analysis

First, even though we will be taking the means of the variables with which we are trying to answer our questions, it is interesting to observe their evolution in each country over time. We started with the GDP.

Evolution of GDP per country

We can see that the GDP of France, Germany, Italy, Spain and the United Kingdom had a significant increase between 2000 and 2008.

Plotting GDP against Diabetes (Men & Women)

Now let’s see if there is a relation between the GDP of a country and its diabetes prevalence. (men = blue, women = red)

We observe that apart of 5 outliers, our observations are mostly bunched up at the left of the graph. We decided to exclude these 5 observations, to see if we can observe a trend with the other countries. These outliers, as we can see on the graph before, are the countries that had a big increase of GDP in the time period of 2000-2013.

Without the outliers, we can see a bit more clearly. Indeed, it seems that the richer a country is, the lesser it has a high diabetes rate among its population.

Evolution of the consumption of macronutrients in calories per country

For the second table, we tried to see again if there was a trend in the consumption of different macro-nutrients in the 2000s for each country in our sample.

In different countries, there is one difference that stands out and that seems to be related to wealth. Indeed, countries with a higher GDP like Austria consume on average more fat as can be seen on this graph:

Whereas, countries with a lower GDP like Bulgaria have a lower fat consumption, as seen below:

There do not seem to be any trends in the graphs above and diets seem rather stable in each country, which is why we will take the average consumption for each macro-nutrient for our analysis. We can however note that the 5 outliers mentioned before tend to have a higher fat consumption than the countries with a smaller GDP.

Plotting GDP against calories consumed

We then wanted to analyse the relation between a country’s GDP and its individual consumption of each macronutrient as well as its total calorie consumption to see if there’s a trend.(total calories = orange, fat = blue, carbohydrates = purple, animal protein = red, plant protein = green)

We see that the calorie consumption does not really change. We wanted a close up on the relation between the total calorie consumption with the GDP for each country to see if we can spot outliers again, so we created other plots.

We end up again with these 5 outliers that have a higher than average GDP so if we remove them, we obtain the following plots :

Now we can more easily state that there’s a trend. It appears that the higher a country’s GDP, the higher the total calories consumed, contrary to our hypothesis.

Evolution of Diabetes per country

Once again, we tried to see if the diabetes prevalence in each country changed over the years 2000-2013.

We saw right away that the prevalence of diabetes is higher for man than women across all countries (there are however two exceptions : in Romania between 2000 and 2003 and Slovenia between 2000 and 2006).

We observed three different scenarios for countries that we selected: A decrease of diabetes over time for:

  • Belgium
  • Denmark
  • Finland

We take Belgium as an example :

A decrease over time for women but not for men for :

  • Austria
  • Malta
  • Netherlands
  • Germany
  • Italy
  • Spain
  • Switzerland

We take Austria as an example :

In other European countries, the prevalence of diabetes is increasing (at different paces) over time.

We take Croatia as an example :

Plotting Diabetes against each type of macronutrient consumption

Finally, we want to plot the relation between the diabetes prevalence against the total calorie consumption as well as each type of macronutrient consumed.

We can see a negative trend for the total consumption, the calories from animal protein and the calories from fat. We can observe a positive trend against calories from plant protein. For protein from carbohydrates, we can see a slighty positive trend for women.

Plotting Diabetes against each type of macronutrient consumption (without outliers)

Now, since they affected our plots that included the GDP variable so much, we want to see if we have different trends when we remove our 5 outliers.

Without our 5 outliers, we observe not much change in the trend of each type of calories consumed apart for carbohydrates where the trend changes for men and become slightly positive.

Analysis

1.Do European countries that have higher GDPs really have lower diabetes prevalence ?

This first question serves more as a control, since we learned during our research prior to our project that countries with higher GDPs tend to have lower diabetes rates. Indeed, we can observe that in the EDA.

It is important to note that, when we try to fit a linear model on these variables and observe correlations over all observations, we see that these relationships are not significant at all.

#> [1] -0.369
#> [1] -0.236
  avg_gdp
Predictors Estimates std. Error Statistic p
(Intercept) 1847.35 654.70 2.82 0.009
prop women diabetes -25161.76 12203.46 -2.06 0.049
Observations 29
R2 / R2 adjusted 0.136 / 0.104


  avg_gdp
Predictors Estimates std. Error Statistic p
(Intercept) 1890.37 1086.48 1.74 0.093
prop men diabetes -19988.04 15812.10 -1.26 0.217
Observations 29
R2 / R2 adjusted 0.056 / 0.021

However, once we exclude outliers, we see that the relationship is way more significant !

#> [1] -0.696
#> [1] -0.739
  avg_gdp
Predictors Estimates std. Error Statistic p
(Intercept) 741.51 123.89 5.99 <0.001
prop women diabetes -10299.65 2263.83 -4.55 <0.001
Observations 24
R2 / R2 adjusted 0.485 / 0.461


  avg_gdp
Predictors Estimates std. Error Statistic p
(Intercept) 1147.62 187.48 6.12 <0.001
prop men diabetes -14048.33 2730.12 -5.15 <0.001
Observations 24
R2 / R2 adjusted 0.546 / 0.526

In the EDA section, we grouped countries in 3 categories, according to the relationship between the GDP and diabetes rate. Here, we confirmed statistically that a relationship exists between these 2 variables when we remove outliers. This therefore made us think that these countries could be categorized into clusters.

To determine the number of clusters we used the elbow method. This method examines the percentage of variance explained as a function of the number of clusters. It is based on the idea that a number of clusters should be chosen such that the addition of another cluster does not allow for a better modeling of the data. The percentage of variance explained by the clusters is plotted against the number of clusters.

#> Warning: Setting row names on a tibble is deprecated.

We therefore see from the graph above that the optimal number of clusters is 3. The allocation of countries according to their cluster is therefore as follows:

Table: Cluster 1
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption cluster
AUT 336 40545 0.053 0.034 245 169 1833 1454 3702 1
BEL 407 38015 0.057 0.039 238 158 1856 1467 3719 1
DNK 275 50213 0.055 0.035 273 157 1746 1190 3366 1
FIN 217 40806 0.066 0.044 269 166 1623 1177 3234 1
FRA 2283 35700 0.071 0.044 293 161 1611 1480 3545 1
DEU 3004 36733 0.059 0.040 240 159 1785 1276 3460 1
IRL 203 46899 0.069 0.049 279 174 1950 1187 3590 1
ITA 1873 32000 0.065 0.047 241 204 1776 1390 3612 1
NLD 722 43883 0.052 0.037 292 136 1599 1195 3222 1
SWE 426 46181 0.058 0.040 285 143 1566 1137 3131 1
CHE 2417 39338 0.063 0.049 232 178 1748 1256 3414 1
Table: Cluster 2
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption cluster
BGR 37.41 4988 0.073 0.064 155 168 1606 846 2775 2
HRV 48.12 11189 0.071 0.059 168 148 1691 940 2946 2
CYP 20.15 18890 0.077 0.056 197 127 1291 1019 2633 2
CZE 158.48 15283 0.078 0.065 218 153 1728 1155 3254 2
EST 16.47 12283 0.071 0.064 212 167 1967 842 3188 2
GRC 246.19 22344 0.069 0.060 250 204 1744 1338 3536 2
HUN 110.92 11050 0.080 0.063 193 153 1560 1221 3126 2
LVA 21.04 9774 0.071 0.065 203 154 1687 1044 3087 2
LTU 30.77 9697 0.078 0.069 274 198 2055 858 3385 2
MLT 7.27 17759 0.088 0.066 237 204 1932 994 3367 2
POL 365.43 9586 0.074 0.066 204 197 1969 1035 3405 2
PRT 200.37 19077 0.075 0.052 275 177 1824 1240 3516 2
ROU 125.08 6062 0.062 0.059 200 220 2003 916 3340 2
SVK 70.82 13146 0.072 0.059 140 150 1610 952 2853 2
SVN 39.50 19502 0.066 0.065 230 168 1664 1067 3129 2
ESP 1176.91 26263 0.084 0.059 279 159 1481 1323 3243 2
Table: Cluster 3
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption cluster
LUX 42.9 87546 0.068 0.039 288 148 1752 1318 3507 3
GBR 490.2 64036 0.050 0.030 237 138 1660 1392 3426 3

Now let’s plot these clusters to see the differences between them.

#> Warning in lda.default(x, grouping, ...): variables are collinear
#>  [1] 3 3 2 2 2 2 3 2 3 3 3 2 2 3 3 2 2 1 2 3 2 2 2 2 2 2 3 1 3

In the graph above,

  • Group 1 = Cluster 3
  • Group 2 = Cluster 2
  • Group 3 = Cluster 1

We can therefore see that diabetes is indeed lower in clusters which contain the countries with higher GPDs. Let’s see if this relationship is significant within each cluster :

#> [1] 0.368
#> [1] 0.297
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1949 2590 -0.752 0.471
countries_cluster_1$prop_women_diabetes 73396 61776 1.188 0.265
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1823 3150 -0.579 0.577
countries_cluster_1$prop_men_diabetes 48218 51591 0.935 0.374
#> [1] -0.235
#> [1] 0.325
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1114 1048 1.062 0.306
countries_cluster_2$prop_women_diabetes -15283 16888 -0.905 0.381
Estimate Std. Error t value Pr(>|t|)
(Intercept) -901 835 -1.08 0.299
countries_cluster_2$prop_men_diabetes 14397 11208 1.28 0.22
#> [1] -1
#> [1] -1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1921 NaN NaN NA
countries_cluster_3$prop_women_diabetes -47598 NaN NaN NA
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1698 NaN NaN NA
countries_cluster_3$prop_men_diabetes -24248 NaN NaN NA

Clustering here doesn’t necessarily help us answer the research question. We can confirm with the previous statistical test without outliers that overall, the GDP of a country and its diabetes rate are negatively correlated.

However, the clusters defined above could help us answer our other research questions.

2.Do European countries that have higher GDPs consume less calories ?

As mentioned in the first point, countries with a higher GDP tend to have a lower diabetes rate, which could potentially be explained by the consumption of fewer calories.

But is there a real correlation between these two variables ? Let’s check :

#> [1] 0.355
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3017.67 1806.147 -1.67 0.106
GDP_diabetes_cal$total_consumption 1.07 0.546 1.97 0.059 .

Neither the correlation between these two variables nor the linear regression is significant. However, it would be interesting to look further. First, we can see if using the GDP per person instead of the total average makes a difference.

The plot looks more or less the same as the one in the EDA section without the outliers. But is the relationship with this new variable statistically significant ?

#> [1] 0.458
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80976.2 41007.8 -1.98 0.059 .
GDP_diabetes_cal$total_consumption 33.2 12.4 2.68 0.012

We see that the correlation with the variable gdp_per_personis higher now and the significance of the parameter in the regression, even though not high enough, increased.

Next we can proceed with an analysis within clusters, with the ones defined in the first question.

#> [1] -0.469
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84886.4 27656 3.07 0.013
countries_cluster_1$total_consumption -12.7 8 -1.59 0.146
#> [1] 0.224
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1941.55 18811.00 -0.103 0.919
countries_cluster_2$total_consumption 5.08 5.91 0.860 0.404
#> [1] 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -939316 NaN NaN NA
countries_cluster_3$total_consumption 293 NaN NaN NA

The results indicate that even within the clusters, these variables are not exactly correlated and the total calories consumed per person in a country is not a significant indicator of this country’s wealth.

When we look at our cluster plot, in terms of total calorie intake it is also the 3rd cluster, which contains the richest countries considering GDP per person, that consumes the most calories. One might therefore think that calorie consumption is not the main reason why high-GDP countries have lower diabetes rates.

3.How do the macronutrients (animal protein/plant protein/carbohydrates/fat) consumed differ between richer and poorer countries ?

We observed during the EDA that richer countries seemed to consume more fat on average. Now we want to see if we can confirm this relationship, and observe if there isn’t a correlation between the average GDP of a country and the consumption of other macronutrients too.

#> [1] 0.272
Estimate Std. Error t value Pr(>|t|)
(Intercept) -716.40 860.87 -0.832 0.413
GDP_diabetes_cal$cal_prot_animal 5.28 3.59 1.470 0.153
#> [1] 0.637
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41157 16464.8 -2.5 0.019
GDP_diabetes_cal$cal_prot_animal 295 68.7 4.3 <0.001 ***
#> [1] 0.147
Estimate Std. Error t value Pr(>|t|)
(Intercept) -273 1053 -0.260 0.797
GDP_diabetes_cal$proportion_animal_prot 11260 14598 0.771 0.447
#> [1] 0.536
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41451 21463 -1.93 0.064 .
GDP_diabetes_cal$proportion_animal_prot 981450 297643 3.30 0.003 **

#> [1] 0.0433
Estimate Std. Error t value Pr(>|t|)
(Intercept) 276.48 1135.65 0.243 0.809
GDP_diabetes_cal$cal_prot_plant 1.52 6.75 0.225 0.823
#> [1] -0.357
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78466 25365 3.09 0.005 **
GDP_diabetes_cal$cal_prot_plant -299 151 -1.99 0.057 .
#> [1] -0.186
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1744 1243 1.403 0.172
GDP_diabetes_cal$proportion_plant_prot -23983 24368 -0.984 0.334
#> [1] -0.683
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135049 22065 6.12 <0.001 ***
GDP_diabetes_cal$proportion_plant_prot -2103072 432612 -4.86 <0.001 ***

#> [1] -0.0828
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1212.644 1588.159 0.764 0.452
GDP_diabetes_cal$cal_carbs -0.393 0.911 -0.432 0.669
#> [1] -0.105
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49289.5 37855.7 1.30 0.204
GDP_diabetes_cal$cal_carbs -11.9 21.7 -0.55 0.587
#> [1] -0.437
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4815 1702 2.83 0.009 **
GDP_diabetes_cal$proportion_carbs -8136 3220 -2.53 0.018
#> [1] -0.575
Estimate Std. Error t value Pr(>|t|)
(Intercept) 163177 36974 4.41 <0.001 ***
GDP_diabetes_cal$proportion_carbs -255576 69975 -3.65 0.001 **

#> [1] 0.503
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1941.46 827.385 -2.35 0.027
GDP_diabetes_cal$cal_fat 2.13 0.703 3.03 0.005 **
#> [1] 0.636
Estimate Std. Error t value Pr(>|t|)
(Intercept) -46079.4 17641 -2.61 0.015
GDP_diabetes_cal$cal_fat 64.2 15 4.29 <0.001 ***
#> [1] 0.428
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2271 1145 -1.98 0.058 .
GDP_diabetes_cal$proportion_fat 7972 3235 2.46 0.02
#> [1] 0.544
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56316 25411 -2.22 0.035
GDP_diabetes_cal$proportion_fat 241607 71784 3.37 0.002 **

The relationship between the average GDP of a country and the calories consumed from fat per person is the only macronutriment that is significant. For other macronutriments, there seems to be no correlation at all.

Let’s see if considering the GDP per person instead of the total makes a difference.

insert analysis here

Another way to answer this research question can be to see the relationship between the wealth of a country and the proportions of the total calories consumed dedicated to each macronutrient.

With proportions, correlation is higher than with calorie count and linear regression parameters a bit more significant but the relationships are still not strong enough.

insert analysis here

#> Warning in lda.default(x, grouping, ...): variables are collinear
#>  [1] 3 3 2 2 2 2 3 2 3 3 3 2 2 3 3 2 2 1 2 3 2 2 2 2 2 2 3 1 3
#> [1] 0.566
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17379 11527 1.51 0.166
countries_cluster_1bis$proportion_animal_prot 308367 149729 2.06 0.07 .
#> [1] -0.351
Estimate Std. Error t value Pr(>|t|)
(Intercept) 61779 18586 3.32 0.009 **
countries_cluster_1bis$proportion_plant_prot -438724 389755 -1.13 0.289
#> [1] 0.553
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28376 34846 -0.814 0.436
countries_cluster_1bis$proportion_carbs 137918 69277 1.991 0.078 .
#> [1] -0.695
Estimate Std. Error t value Pr(>|t|)
(Intercept) 101564 20955 4.85 0.001 ***
countries_cluster_1bis$proportion_fat -162308 56003 -2.90 0.018

analysis cluster 1

#> [1] 0.676
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13068 8011 -1.63 0.125
countries_cluster_2bis$proportion_animal_prot 404665 117759 3.44 0.004 **
#> [1] -0.417
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38383 14156 2.71 0.017
countries_cluster_2bis$proportion_plant_prot -448628 261105 -1.72 0.108
#> [1] -0.751
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66623 12383 5.38 <0.001 ***
countries_cluster_2bis$proportion_carbs -95782 22540 -4.25 0.001 ***
#> [1] 0.659
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13799 8617 -1.60 0.132
countries_cluster_2bis$proportion_fat 84480 25780 3.28 0.006 **

analysis cluster 2

#> [1] 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61500 NaN NaN NA
countries_cluster_3bis$proportion_animal_prot 1812514 NaN NaN NA
#> [1] 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -398204 NaN NaN NA
countries_cluster_3bis$proportion_plant_prot 11498699 NaN NaN NA
#> [1] 1
Estimate Std. Error t value Pr(>|t|)
(Intercept) -678064 NaN NaN NA
countries_cluster_3bis$proportion_carbs 1532053 NaN NaN NA
#> [1] -1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 378544 NaN NaN NA
countries_cluster_3bis$proportion_fat -774350 NaN NaN NA

analysis cluster 3

Oddly enough, it seems that a higher consumption of animal protein could be related to the rate of diabetes, which would be counter-intuitive to M. Adeva-Andany’s (2019) article “Dietary habits contribute to define the risk of type 2 diabetes in humans”.

4.How do these differences in macronutrients relate to the diabetes prevalence in these countries ? What is the typical diet that can be observed in richer states that relates to lower diabetes prevalence ?

What we observed in the EDA seemed to not make sense as we were expecting a positive relationship between the total calories/calories consumed from fat and diabetes prevalence in a country. Let’s see now concretely if there is any correlation between the calories consumed from different macronutrients and the diabetes rate.

#> [1] -0.246
#> [1] -0.515
#> [1] 0.227
#> [1] 0.387
#> [1] -0.0322
#> [1] 0.155
#> [1] -0.404
#> [1] -0.697

The correlations are quite significant for fat calories, animal protein and also for plant protein. Nevertheless, carbs would not have a huge effect on the rate of diabetes. Furthermore, it is interesting to note that the correlation is always higher for women. Perhaps malnutrition raises the risk of diabetes in women more than in men? We will continue this analysis by running linear regressions.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.081 0.01 7.94 <0.001 ***
GDP_diabetes_cal$cal_prot_animal 0.000 0.00 -1.32 0.198
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.087 0.011 7.73 <0.001 ***
GDP_diabetes_cal$cal_prot_animal 0.000 0.000 -3.12 0.004 **
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.052 0.013 4.00 <0.001 ***
GDP_diabetes_cal$cal_prot_plant 0.000 0.000 1.21 0.237
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.019 0.015 1.25 0.222
GDP_diabetes_cal$cal_prot_plant 0.000 0.000 2.18 0.038
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.071 0.019 3.780 0.001 ***
GDP_diabetes_cal$cal_carbs 0.000 0.000 -0.167 0.868
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.034 0.023 1.457 0.157
GDP_diabetes_cal$cal_carbs 0.000 0.000 0.816 0.422
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.092 0.01 8.84 <0.001 ***
GDP_diabetes_cal$cal_fat 0.000 0.00 -2.29 0.03
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.103 0.01 10.19 <0.001 ***
GDP_diabetes_cal$cal_fat 0.000 0.00 -5.05 <0.001 ***

Only animal protein and fat are significant for women which will further support the view that malnutrition has a greater impact on the rate of diabetes in women than in men. We will now take the proportions to get a better idea of the relationship of macronutrients to diabetes rates.

#> [1] -0.141
#> [1] -0.399
#> [1] 0.452
#> [1] 0.706
#> [1] 0.272
#> [1] 0.616
#> [1] -0.303
#> [1] -0.621
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.077 0.012 6.193 <0.001 ***
GDP_diabetes_cal$proportion_animal_prot -0.128 0.173 -0.739 0.466
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.084 0.014 5.89 <0.001 ***
GDP_diabetes_cal$proportion_animal_prot -0.448 0.198 -2.26 0.032
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.033 0.013 2.49 0.019
GDP_diabetes_cal$proportion_plant_prot 0.689 0.262 2.63 0.014
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.015 0.013 -1.15 0.26
GDP_diabetes_cal$proportion_plant_prot 1.333 0.258 5.17 <0.001 ***
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.037 0.022 1.70 0.101
GDP_diabetes_cal$proportion_carbs 0.060 0.041 1.47 0.154
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.036 0.022 -1.65 0.11
GDP_diabetes_cal$proportion_carbs 0.168 0.041 4.06 <0.001 ***
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.091 0.014 6.40 <0.001 ***
GDP_diabetes_cal$proportion_fat -0.067 0.040 -1.65 0.11
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.112 0.015 7.68 <0.001 ***
GDP_diabetes_cal$proportion_fat -0.169 0.041 -4.12 <0.001 ***

All macronutrients except protein are highly significant for women, which confirms our hypothesis that malnutrition has a greater impact on the rate of diabetes in women than in men.

We are going to investigate this very issue of calorie consumption patterns that could be related to low diabetes rate for each cluster. To see the patterns of each cluster we will average each cluster.
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption
Means 1106 40938 0.061 0.042 262 164 1736 1292 3454
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption
Means 167 14181 0.074 0.062 215 172 1738 1049 3174
avg_gdp gdp_per_person prop_men_diabetes prop_women_diabetes cal_prot_animal cal_prot_plant cal_carbs cal_fat total_consumption
Means 267 75791 0.059 0.035 263 143 1706 1355 3466

As we see from the cluster averages, cluster 2 has the highest rate of diabetes followed by cluster 1 and finally the cluster 3 with the lowest rate of diabetes.

There is not much difference between the different patterns for the different clusters. However, it should be noted that carbs represent 53% of the diet of cluster 2 and that this food greatly influences the rate of diabetes, particularly among women. Indeed, there is a 1% increase in diabetes for every 0.168 carbs consumed on average per capita with a significant p-value at 1%. The proportion of carbs should therefore be reduced to less than half of the total proportion to try to have a negative impact on the diabetes rate.

Strangely enough, the proportion of fat is lower in cluster 1 (33%) which has the highest diabetes rate which is therefore counter intuitive. But this result is consistent with the negative correlations of fat with diabetes rate calculated above. It is therefore complicated to draw a conclusion regarding the proportion of fat to be consumed to reduce the diabetes rate. Our hypothesis is that even if cluster 1 consumes more fat, it is composed of rich countries that have more prevention measures and certainly better hospital infrastructure.

Conclusion